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Abstract 

The dynamics of attention in social media tend to obey 
power laws. Attention concentrates on a relatively small 
number of popular items and neglecting the vast major¬ 
ity of content produced by the crowd. Although popu¬ 
larity can be an indication of the perceived value of an 
item within its community, previous research has hinted 
to the fact that popularity is distinct from intrinsic qual¬ 
ity. As a result, content with low visibility but high qual¬ 
ity lurks in the tail of the popularity distribution. This 
phenomenon can be particularly evident in the case of 
photo-sharing communities, where valuable photogra¬ 
phers who are not highly engaged in online social in¬ 
teractions contribute with high-quality pictures that re¬ 
main unseen. We propose to use a computer vision 
method to surface beautiful pictures from the immense 
pool of near-zero-popularity items, and we test it on a 
large dataset of creative-commons photos on Flickr. By 
gathering a large crowdsourced ground truth of aesthet¬ 
ics scores for Flickr images, we show that our method 
retrieves photos whose median perceived beauty score 
is equal to the most popular ones, and whose average is 
lower by only 1.5%. 


1 Introduction 


One of the common uses of online social media surely 
is to accrue social capital by winning other people’s at- 
tentio n ([Steinfield, Ellison, and Lampe 2008} Smith and] 
Giraud-Carrier 2010; Bur ke, Kraut, and Marlow 20 lit 


Bohn et al. 2014). The ever-increasing amount of content 
produced by the crowd triggers emergent complex dynam¬ 
ics in which different pieces of information have to com¬ 
pete for the limited attention of the audience ( Romero et al. 


2011| ). In this process, only few individuals and the con¬ 


tent they produce emerge and become popular, while the 
vast majority of people are bound to a very limited visibility, 
their contribu tions being rapidly forgotten ( |Cha et al. 2007 [ 
Sast ry 2012| ). Such dynamics do not necessarily promote 
high-quality content ( |Weng et al. 2012| ), possibly confining 
some valuable information an d expert users in the very tail 
of the popularity distribution ( Goel et al. 2010| . This might 
cause a loss to the community, first because tail contribu- 
tors are likely to lose engagement and churn out ( [Karnstedtj 
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|et al. 20 lT] ), but also because tail content is ofte n less cu¬ 
rated and difficult to find through search (|Baeza- Yates and 


Saez-Trumper 2013). 


Previous work has focused extensively on studying the 
patterns of popularity of social media users and of all 
sorts of online content, trying to isolate the predictive fac¬ 
tors of succe ss (|Suh et al. 2010 ; Hong, Dan, and Davison 


2011; Brodersen, Scell ato, and Wattenhofer 201 2[ |Khosla^ 

Das Sarma, and Hamid 2014 ). However, considerably less 
effort has been spent in finding effective ways to surface 
high-quality content from the sea of forgetfulness of the pop¬ 
ularity tail. Finding valuable content in the pool of unpopu¬ 
lar items is an intrinsically difficult task because tail items i) 
are many, outnumbering by orders of magnitude those with 
medium or high popularity, ii) have most often low qual¬ 
ity, making random sampling strategies substantially inef¬ 
fective, and iii) tend to be less annotated and therefore more 
difficult to index. 

We contribute to tackle these problems in the context of 
photo sharing services. We use a computer vision method 
to surface beautiful pictures among those with near-zero- 
popularity, with no need of additional metadata. Our ap¬ 
proach is supervised and relies on features developed in 
the field of computational aesthetics ( Datta et al. 2006| . 
To train our framework, we collect for the first time a 
large ground truth of aesthetic scores assigned to Flickr im¬ 
ages by non-expert subjects via crowdsourcing. Differently 
from conventional aesthetics^ datase ts ([Datta et al. 2006} 

Murray, Marchesotti, and Perronnin 2012 ), our ground truth 
includes images with a wide spectrum of quality levels and 
better reflects the taste of a non-professional public, making 
it the ideal training set to classify web images. 

When tested on nearly 9M creative-commons Flickr pic¬ 
tures, our method is able to surface from the set of photos 
that received very low attention (<5 favorites) a selection of 
images whose perceived beauty is close to that of the most 
favorited ones, with the same median value and an average 
value that is just 1.5% lower. Results are consistent for im¬ 
ages in four different topical categories and largely outper¬ 
form a random baseline, computer vision methods trained on 
traditional aesthetics databases, and a state-of-the-art com¬ 
puter vision methods targeted to the prediction of image 
popularity ( [Khosla, Das Sarma, and Hamid 2014[ ). 

We summarize our main contributions as follows: 


























































• We build and make publicly availably the largest ground 
truth of aesthetic scores for Flickr photos constructed so 
far, including 10.800 pictures of 4 different topical cat¬ 
egories and 60K judgments. We carefully designed the 
crowdsourcing experiment to account for the biases that 
can incur in a task that is characterized by a strong sub¬ 
jective component. 

• We provide an analysis of ordinary people’s aesthetics 
perception of web images. We find that perceived beauty 
and popularity are correlated (p = 0.43) but the beauty 
scores of very popular items have higher variance than 
unpopular ones. We find that a non-negligible number of 
unpopular items are extraordinarily appealing. 

• We propose a method to retrieve beautiful yet unpopular 
images from very large photo collections. Our approach 
works in a pure cold start scenario as it needs in input only 
the visual information of the picture. Also, it overcomes 
the issue of sparsity (i.e., very few beautiful pictures hid¬ 
den among very large amounts of mediocre images) with 
surprisingly high precision, being able to retrieve images 
whose perceived beauty is comparable to the top-rated 
photos. 

After a review of the related work (Q, we touch upon the 
popularity skew in Flickr (Q. We then describe the process 
of collection of the aesthetics scores through crowdsourcing 
(0- Next, we describe the computer vision method we use 
to identify beautiful pictures and we report the aesthetic 
prediction results in comparison with other baselines (Q. 
Last, we show that our method can surface beautiful photos 
from a large pool of non-popular ones (Q. 


2 Related work 

Popularity Prediction. Being able to characterize and pre¬ 
dict item popularity in social media is an important, yet 
not fully solved task (Hon^ Dan, and Davison 2011). The 
possibility of predicting the popularity of videos and pic¬ 
tures in social platforms like YouTube, Vimeo, and Flickr 
has been explored ext ensively ( Cha, Mi slove, and Gum 


madi 2009j ; Figueiredo, Benevenuto, and Almeida 2011; 


Brodersen, Scellato, and Wattenhofer 20 12| [Ahmed et al.| 
|2013[ ). Multimodal supervised approaches that combine 
metadata and computer vision features have been used to 
predict photo popularity. Visual features like coarseness 
and colorfulness, well predict the number of favorites in 
Flickr ( |San Pedro and Siersdorfer 2009) and the number of 
reshares in Pinterest to some extent ( [Totti et al. 2014| ). The 
presence of specific visual c once pts in the image, such as hu¬ 
man faces ( [Bakhshi, Shamma, and Gilbert 2014 ), are good 


predictors too. Recently, Khosla et al. (Khosla, Das Sarma, 
|and Hamid 2014| have made one of the most mature con¬ 
tributions in this area, training a SVR model on both visual 
content and social cues to predict the normalized view count 
on a large corpus of Flickr images. While previous work 
tries to understand why popular images are successful, we 
flip the perspective to see if high-quality pictures hide in 
the long tail and to what extent we are able to automatically 
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surface them. This necessity is also supported by the weak 
correlation between received attention and perceived quality 
found in small image datasets fifsieh, Hsu, and Wang 2014| . 

Popularity vs. Quality. Both social and computer scien¬ 
tist have investigated the relation between popularity and in¬ 
trinsic quality of content. Items’ popularity is only partly 
determined by their quality and it is largely steered by the 
early popularity distribution, often with unpredictable pat¬ 
terns ( [Salganik, Dodds, and Watts 2006] ). User’s limited 
attention drives the popularity persistence and virality of 
an item more than its intrinsic appeal ( [Weng et al. 2012| 
Hodas and Lerman 2012| . A piece of content can attract 


attention because of many factors including the favorable 
structural position of its creator in a social network 
Dan, and Davison 201 1|), th e sentiment conveyed 


message ( [Quercia et al. 201 1| ), or the demographic |Su 


Hong, 


3y the 


i et al. 


2010) and geographic ([Brodersen, Scellato, and Wattenhofer 


2012) compositi on oft he audience. On video ( [Sastry 2012| ) 
or image ( [Zhong et al. 2013) sharing platforms, the con¬ 
tent that receives larger shares of attention is often of niche 
topical interest. Adopting community-specific behavioural 
norms can also increase popularity returns. On Twitter, users 
who generate viral posts are those who limit their tweets to 
a single topic (Ch a et al. 2010| ). On Facebook, communicat¬ 
ing along weak ties is the key to spread content ( Baks hy e~t| 
Iran). More in general, social activity, even in its most 
superficial meaning (e. g., “p okin g”) can b e a p owerful at¬ 
tractor of popularity ( Vaca Ruiz, Aiello, and Jaimes 2014; 
[Aiello et al. 2012| . 

Computational Aesthetics. Computational aesthetics is the 
branch of computer vision that studies how to automatically 
score image s in terms of their photographic beauty. Datta 
et al. ( |2006| ) and Ke et al. ( |2006| ) designed the first com¬ 
positional features to distinguish amateur from professional 
photos. Computational aesthetics researchers have been de¬ 
veloping dedicated discriminative visual features and at¬ 
tributes ( [Nishiyama et al. 2011} |Dhar, Ordonez, and Berg| 
2011|), generic sema ntic features "fMarchesotti et al. 2011} 
Murray, Marchesotti, and Perronni n 2012|), topic -specific 


model s~( jLiio and Tang 2008} |Obrador et al. 2009]) a nd ef¬ 
fective learning frameworks qWu, Hu, and Gao 201 1| ) to im¬ 
prove the quality of the aesthetics predictors. Aesthetic fea¬ 
tures have been also used to infer higher-level properties of 


images and videos^such as image affective value (Macha- 
jdik and Hanbury 2010| ), image memorability (Iso 


a et al. 


201 1| ), video creativity (Redi et al. 2014b|), and video in 


terestingness (Redi and Merial do" 2012| | Jiang et al. 2013 ). 
To our knowledge, this is the first time that image aesthetic 
predictors are used to expose high quality content from low- 
popular images in the context of social media. 


Ground Truth for Image Aesthetics. Existing aesthetic 
ground truths are often derived from_photo contest web¬ 
sites, such as DPChallenge.com ( |Ke, Tang, and Jing 2006| ) 
or Photo.net ( [Datta et al. 2006| , where (semi) professional 
photographers can rate the quality of their peers’ images. 
The average quality and style of the images in such datasets 
is way higher than the typical picture quality in photo shar¬ 
ing sites, making them not suitable to train general aesthetic 







































































































Category 

Tags 

people 

nature 

animals 

urban 

people, face, portrait, groupshot 
flower, plant, tree, grass, meadow, mountain 
animal, insect, pet, canine, carnivore, butterfly, 
feline, bird, dog, peacock, bee, lion, cat 
building, architecture, street, house, city, 
church, ceiling, cityscape, brick, tower, win¬ 
dow, highway, bridge 



Figure 1: (Left) Distribution of the number of favorites for Flickr 
photos and users. (Right) Average number of comments, tags, and 
uploads to group photo pools for photos with a fixed number of 
favorites. 


models. Hybrid datasets ( |Luo, Wang, and Tang 201 If that 
add lower-quality images to overcome this issue are also 
not good for training (Murray, Marchesotti, and Perronnin 
2012). In addition, social signals such as Flickr interesting- 

cl 


nes: 


(Jiang et al. 2013) are often used as a proxy for aes¬ 
thetics in that type of datasets. However, no quantitative ev¬ 
idence is given that neither the Flickr interestingness nor the 
popularity of the photographers are good proxies for image 
quality, which is exactly the research question we address. 
Crowdsourcing constitutes a reliable way to collect ground 
truths on image features ( [Redi and Povoa 20T4) , the only at¬ 
tempt to do it in the context of aesthetics has been limited in 
scope (faces) and very small-scale ( |Li et al. 2010| ). 


3 Popularity in Flickr 

Flickr is a popular social platform for image sharing. Users 
can establish directed social links by “following” other users 
to get updates on their activity. Users can label their own 
photos with free-text tags and publish them in the photo 
pools of groups. Every public photo can be marked as fa¬ 
vorite or annotated with a textual comment by any user in 
the platform. Flickr also maintains and updates periodically 
the Explore pag^] a showcase of interesting photos. 

The complex dynamics that attract attention towards 
Flickr images revolve around all the above mentioned mech¬ 
anisms of social feedback that, as in any other social net¬ 
work, tend to promote some items more than others. As a 
result, the distribution of picture popularity —usually mea¬ 
sured by the number of favorites ( |Cha, Mis' lov e, and Gum- 
madi 2009] )— is very broad. Figure [T] (left) shows statistics 
on user and image popularity computed over a random sam¬ 
ple of 200M public Flickr photos that have been favorited at 
least once. The distribution of the mass of favorites over the 
photos is highly unequal (Gini coefficient 0.68): the number 
of favorites of the pictures in this sample spans four orders 
of magnitude, with the majority of them having only one fa¬ 
vorite (52%). The same figure holds when aggregating the 
popularity by users: some accumulate thousands favorites 
while the vast majority (~70%) rustles up less than ten. 

As for the intuition given by the Infinite Monkey Theo¬ 
rem , the unpopular users must be able to collectively pro¬ 
duce a certain amount of exceptionally valuable content just 

2 

Flickr interestingness algorithm is secret, but it considers some metrics of so¬ 
cial feedback. For more details refer to https : //www. flickr. com/explore/ 
interesting 

1https://www.flickr.com/explore 


Table 1: Set of machine tags included in each image category 

because of their substantial number. More concretely, it is 
hard to believe that there is no high-quality photo among 
166M pictures with five favorites or less. Estimating how 
many beautiful pictures lie in the popularity tail and under¬ 
standing how we can draw those out of the immense mass 
of user-generated content are the main goals of this contri¬ 
bution. 

One may think that one possibility to achieve the goal 
would be to leverage different types of social feedback (e.g., 
comment). However, unpopular items rarely receive social 
feedback. As displayed in Figure [I] (right), the number of 
comments, tags, and uploads in groups is positively corre¬ 
lated with the number of favorites, with near-zero favorite 
pictures having a near-zero amount of all the other metrics, 
on average. Providing a method that does not rely on any 
type of explicit feedback has therefore the advantage of be¬ 
ing more general and suitable for a cold-start scenario. For 
this reason, we rely on a supervised computer vision method 
that we describe in ^5] and whose training set is collected as 
described in the next section. 


4 Ground truth for image aesthetics 

We build a ground truth for aesthetics from a 9M random 
sample of the Creative Commons Flickr Images datase^] 
We collect the annotations using CrowdFlowei^] a large 
crowdsourcing platform that distributes small, discrete tasks 
to online contributors. Next we describe how we selected 
the images for our corp us ( §4.1| ), how we run the crowd- 
source experiment (£4.2), and the results on the beauty judg¬ 
ments we got from it ( §4.3| ). 


4.1 Definition of the image corpus 

To help the contributor in the assessment of the image 
beauty, we build a photo collection that i) presents topically 
coherent images and ii) represents the full popularity spec¬ 
trum, thus ensuring a diverse range of aesthetic values. 

Topical Coherence. Different picture categories can 
achieve the same aesthetic quality driven by different crite¬ 
ria ( |Luo, Wang, and Tang 201 1| ). To make sure that contrib¬ 
utors use the same evaluation standard, we group the images 
in classes of coherent subject categories. To do that, we use 
Flickr machine tag ^ namely tags assigned by a computer 
vision classifier trained to recognize the type of subject de¬ 
picted in a photo (e.g., a bird or a tree) with a certain confi¬ 
dence level. We manually group the most frequent machine 

4 http://bit.ly/yfcclOOm 

1http://www.crowdflower.com 
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How beautiful is this image? (Nature) 


How beautiful is this picture? 

1 2 3 4 S 

Unacceptable Exceptional 


Figure 2: Screenshot of the crowdflower job: instruction examples 
(left) and voting task (right). 


tags in topically-coherent macro-groups, coming up with 4 
final categories: people , nature , animals , and urban. We 
only consider the pictures associated with high-confidence 
machine tags (>0.9). Moreover, we manually clean the fi¬ 
nal photo selection by replacing few instances that suffered 
from machine tag misclassification. The full list of machine 
tags per category is reported in Table [T] 

Full Popularity Range. Within each category, we are inter¬ 
ested in assessing the perceived beauty of photos with differ¬ 
ent popularity levels. To do so, we identify three popularity 
buckets obtained by logarithmic binning over the range of 
number of favorites /. We refer to them as tail (/ < 5), 
torso (5 < / < 45), and head (/ > 45). The tail of the 
distribution contains 98% of the photos, whereas the torso 
and head contain 1.6% and 0.4% respectively. We randomly 
sample, within each category, 1000 images from the tail and 
1000 from the torso. Because of the reduced number of most 
popular pictures we do not sample randomly the head bucket 
but we consider the top 500 instead. Images from such di¬ 
verse popularity levels are also likely to take a wide range of 
aesthetic values, thus ensuring aesthetic diversity in our cor- 
pu s, ver y important to get reliable beauty judgements ( |Redi 
|et al. 2014a] ). 


4.2 CrowdFlower experiment 

Crowdsourcing tasks are influenced by a variety of human 


factors that are not always easy to control ( Mason and Suri 
|2012| ). However, platforms like CrowdFlower offer ad¬ 
vanced mechanisms to tune the annotation process and en¬ 
able the best conditions to get high-quality judgments. To 
facilitate the reproducibility of our experiment, next we re¬ 
port the main setup parameters. 


Task interface and setup. The task consists in looking at a 
number of images and evaluating their aesthetic quality. At 
the top of the page we report a short description of the task 
and we ask “How beautiful is this picture?”. The contribu¬ 
tor is invited to judge the intrinsic beauty of an image and not 
the appeal of its subject; high quality, artistic pictures that 
depict a non-conventionally beautiful subject (e.g., a spider), 
should be marked as beautiful and viceversa. Screenshots of 
the Crowdflower job interface are shown in Figure [2] 


1 

Unacceptable 

Extremely low quality, out of focus, un¬ 
derexposed, badly framed images 

2 

Flawed 

Low quality images with some tech¬ 
nical flaws (slightly blurred, slightly 
over/underexposed, incorrectly framed) 
and without any artistic value 

3 

Ordinary 

Standard quality images without tech¬ 
nical flaws (subject well framed, in fo¬ 
cus, and easily recognizable) and with¬ 
out any artistic value 

4 

Professional 

Professional-quality images (flawless 
framing, focus, and lightning) or with 
some artistic value 

5 

Exceptional 

Very appealing images, showing both 
outstanding professional quality (pho¬ 
tographic and/or editing & techniques) 
and high artistic value 


Table 2: Description of the five-level aesthetic judgment scale 


Although several approaches and rating scales can be used 
to get quality feedback ( |Fu et al. 2014| ), we use the 5-point 
Absolute Category Rating (ACR) scale, ranked from “Un¬ 
acceptable” to “ Exceptional”, as it is a good way to collect 
aesthetic preferences ( [Siahaan, Redi, and Hanjalic 2013| . To 
help the annotators in their assessment, two example images 
and a textual description of each grade are provided (see Fig- 
ure[2]and Table[2]). The examples are Flickr images that have 
been unanimously judged by three independent annotators 
to be clear representatives of that beauty grade. Below the 
examples, each page contains 5 randomly selected images 
Cunits of work in CrowdFlower jargon), each followed by 
the radio buttons to cast the vote. The random selection of 
images allows us to mix pictures from different popularity 
ranges in the same page, thus offering to the users an eas¬ 
ier context for comparison ( |Fu et al. 2014| ). We show all 
the images with approximately the same (large) size be cause 
image size can skew the perception of image quality ([Chu, 
|Chen, and Chen 20T3| . 

Each photo receives at least 5 judgments, each one by a 
different independent contributor. Each contributor can sub¬ 
mit a maximum of 500 judgments, to prevent a predomi¬ 
nance of a small group of workers. Contributors are geo¬ 
graphically limited to a set of specific countrie^] to ensure 
higher cultural homogeneity in the assessment of image aes¬ 
thetics ( [Hagen and Jones 1978| . Only contributors with an 
excellent track record on the platform (responsible for the 
7% of monthly CrowdFlower judgments overall) have been 
allowed. We also banned workers that come from external 
crowdsourcing channels that have a ratio of trusted/untrusted 
users lower than 0.9. 


Quality control. Test Questions (also called Gold Standard) 
are used to test and track the contributor’s performance and 
filter out bots or unreliable contributors. To access the task, 
workers are first asked to annotate correctly 6 out of 8 Test 
Questions in an initial Quiz Mode screen and their perfor¬ 
mance is tracked throughout the task with Test Questions 
randomly inserted in every task, disguised as normal units. 
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Matching% 

Fleiss’ K 

Cronbach’s a 

people 

68.82 

0.38 

0.74 

nature 

72.65 

0.27 

0.71 

animals 

69.37 

0.35 

0.8 

urban 

73.13 

0.38 

0.8 



Units 

Judgments 

Workers 

Countries 

Trust 

people 

2500 

12725 

141 

13 

0.843 

nature 

2500 

15054 

178 

14 

0.841 

animals 

2500 

13269 

117 

13 

0.80 

urban 

2500 

13213 

111 

13 

0.839 


Table 3: General statistics on the crowdsourcing experiment Table 4: Measures of judgment agreement 


To support the learning process of a contributor, we tag each 
Test Question with an explanation that pops up in case of 
misjudgment (e.g., “excellent combination of framing, light¬ 
ning, and colors resulting in an artistic image, visually very 
appealing” is one of the description for an high rated item). 

To build the set of Test Questions , we first collected about 
200 candidate images from different online sources includ¬ 
ing Flickr, web re posit ories, aesthetics corpora ( [Murray, | 
Marchesotti, and Perronnin 2012), and relevant photos re¬ 
trieved by the main image search engines. Three indepen¬ 
dent editors manually annotated the candidate sets with a 
beauty score. For each category, we run a small-scale pilot 
CrowdFlower experiment to consolidate the editors’ assess¬ 
ment taking into account the micro-workers feedback. This 
process led us to mark some of the Test Question with two 
contiguous scores. After this validation step, we identified 
the set of 100 images with the highest agreement that be¬ 
longs to the full range of grades. 


4.3 Results 

We run a separate job for each topical category. Table [3] 
summarizes the number of units annotated, judgments sub¬ 
mitted, distinct participants, and the average accuracy (trust) 
on Test Questions of the contributors. Each unit can receive 
more than 5 independent judgments; in the case of nature we 
collected 20% more judgments than for the other categories. 
On average, more than 140 contributors geographically dis¬ 
tributed in 13 countries and characterized by a high level of 
trustworthiness participated to each experiment. 


Inter-rater agreement. To assess the quality of the col¬ 
lected data, we measure the level of agreement between an¬ 
notators. Table 0] shows a set of standard measures to eval¬ 
uate the inter-rater consistency. Matching% is the percent¬ 
age of matching judgments per item. Across categories the 
agreement is solid, with an average of 70%. However, the 
ratio of matching grades does not capture entirely the ex¬ 
tent to which agreement emerges. In fact, the task is inher¬ 
ently subjective and in some cases the quality of an image 
naturally converges to an intermediate level. We therefore 
compute the Fleiss’ K , a statistical measure for assessing 
the reliability of the agreement between a fixed number of 
raters. Since Fleiss’ K is used to evaluate agreements on cat¬ 
egorical ratings, it is not directly applicable to our task. We 
therefore binarize the task, and assign to each judgment ei¬ 
ther a Beautiful or NotBeautiful label, according to the 
score being respectively greater or lower than the median. 
Consistently, the Fleiss’ K shows a fair level of agreement. 
To further evaluate inter-participant consistency we com¬ 
puted the Cronbach’s a that has been extensively adopted in 
the context of assessing inter-rater agreement on aesthetics 
tasks ( [Siahaan, Redi, and Hanjalic 2013] ). For all categories, 
the Cronbach’s coefficient lies in the interval 0.7 < a < 0.9 
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Figure 3: Relation between popularity (number of favorites) and 
crowdsourced beauty scores for 10,800 Flickr pictures. 

that is commonly defined as a Good level of consistency. 

Beauty judgements. The Spearman correlation p between 
the number of favorites and the average beauty score is 0.43. 
Although the correlation is substantial, the variability of per¬ 
ceived beauty for each popularity value is very high. In 
Figure [3] we plot the beauty score against the number of 
favorites, for each photo. Zero-popularity images span the 
whole aesthetics judgment scale, from 1 to 5, and most pop¬ 
ularity levels have photos within the [2.5, 5] beauty range. 
Very low scores (1,2) are rare. This picture confirms our ini¬ 
tial motivation as it shows instances of unpopular yet beauti¬ 
ful photos, as well as a good portion of very popular photos 
with average or low quality. 

Results on the distribution of judgments across categories 
and popularity buckets are summarized in Figure [4] As ex¬ 
pected, the high bucket shows the highest average score fol¬ 
lowed by the medium and the low. With the exception of 
the people category, the standard deviation follows the same 
trend: higher popularity corresponds to higher disagree¬ 
ment. This might be due to the fact that viewers are likely to 
largely agree on objective elements that make an image non¬ 
appealing, such as technical flaws (e.g., bad focus) but on the 
other hand they might not agree on what makes an image ex¬ 
ceptionally beautiful, which can be a more subjective char¬ 
acteristic. Given that the more a photo is popular the more 
it tends to be appealing, this phenomenon can partly explain 
the inconsistent agreement level among popularity buckets. 
Across categories we observe that animals images have the 
highest average quality perception (3.49 db 0.75) while the 
remaining categories show a mean around 3.31. 

5 Image Aesthetics 

Having collected a ground truth of crowdsourced beauty 
judgements, we now design a computational aesthetic 
framework to surface beautiful, unpopular pictures. Our 
method is based on regressed compositional features , 
namely visual features that are specifically designed to de¬ 
scribe how much an image fulfills standard photographic 
rules. We design our framework as follows: 
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Figure 4: Distribution of ratings across categories and popularity 
buckets. The red lines and their surrounding areas represent the 
average and standard deviation. 


Visual Features. We design a set of visual features to ex¬ 
pose image photographic properties. More specifically, we 
compose a 47-dimensional feature vector with the following 
descriptors: 

• Color Features. Color patterns are important cues to un¬ 
derstand the aesthetic and affective value of a picture. 
First, we compute a Contrast metric, that provides infor¬ 
mation about the distinguishability of colors based on the 
magnitude of the average luminance: 

V — V 

r-i , , 1 max 1 mm 

Contrast = -- (1) 


where Y max , Y m i n , Y correspond respectively to maxi¬ 
mum, minimum, and average of the luminance channel. 
We then extract the average of the Hue, Saturation, 
Brightness (H,S,V) channels, computed both on the whole 
image and on the inner quadrant resulting after a 3x3 di¬ 
vision of the image, similar to previous approaches (Datta 
et al. 2006). By combining average Saturation (5) and 
Brightness (V) values, we also extract three indicators 
of emotional dimensions, Pleasure, Arousal and Domi¬ 
nance , as suggested by previous work on affective image 
analysis (Machajdik and Hanbury 2010): 


Pleasure = 0.69V + 0.225' 


Arousal = -0.31V + 0.605 (2) 

Dominance = 0.76V + 0.325 


After quantizing the HSV values, we also collect the oc¬ 
currences of 12 Hue bins, 5 Saturation bins, and 3 Bright¬ 
ness bins in the HSV Itten Color Histograms. Finally, 
we compute Itten Color Contrasts , i.e. the standard de¬ 


viation of H, S and V Itten Color Histograms (Machajdik 
[and Hanbury 2010| . 


• Spatial Arrangement Features. Spatial arrangement of 
objects, shapes and people plays a key role in the shoot¬ 
ing of good photographs ( [Freeman 2007| . To analyze the 
spatial layout in the scene, first, we resize the image to a 
squared matrix Iy, and we compute a Symmetry descrip¬ 
tor based on the difference of the Histograms of Oriented 


Gradients (HOG) ( [Dalai and Triggs 2005| ) between the 
left half of the image and its flipped right half: 


Symmetry = ^(I 1 ) — 4>((I • J) r 


(3) 


where <f> is the HOG operation, I 1 is the left half of the 
image, and (I • J) r is the flipped right half of the image, 
being J the anti diagonal identity matrix that imposes the 
left-right flipping of the columns in Iy. We also consider 
the Rule of Thirds , a photographic guideline stating that 
the important compositional elements of a picture should 
lie on four ideal lines (two horizontal, two vertical) that 
divide it into nine equal parts (the thirds). To model it, 
from the resized image I y, we compute the a saliency ma¬ 
trix ( [Hou and Zhang 2007| , exposing the image regions 
that are more likely to grasp the attention of the human 
eye. We then analyze the distribution of the salient zones 
across the image thirds by retaining the average saliency 
value for each third subregion. 


• Texture Features. We describe the overall complexity and 
homogeneity of an image by computing the Haralick’s 
features ( [Haralick 1979| ), namely the Entropy, Energy, 
Homogeneity, Contrast of the Gray-Level Co-occurrence 
Matrices. 


Groundtruth. We use our crowdsourced groundtruth as the 
main source of knowledge for our supervised framework. 
Since topic-specific aesthetic models have been_shown to 
perform better than general frameworks ( [Luo, W ang, and 
Tan g 20 lT] ), we keep the division of the ground truth into 
semantic categories ( people , urban , animals , nature ), and 
learn a separate, topic-specific aesthetic model for each cat¬ 
egory. 

Learning Framework. We train category-specific models 
using Partial Least Squares Regression (PLSR), a very ef- 
fecive prediction framework for visual pattern analysis (?). 
For each semantic category, PLSR learns a set of regres¬ 
sion coefficients , one per dimension of the visual feature 
vector, by combining principles of least-squares regression 
and principal component analysis. Each category-specific 
group of regression coefficients constitutes a separate aes¬ 
thetic model. 


Prediction and Surfacing. We apply the models to auto¬ 
matically assess the aesthetic value of new, unseen images 
(i.e., images that do not belong to the training set). To do so, 
we use the regression coefficients in a linear combination 
with the features of each image, thus obtaining the predicted 
aesthetic score for that image. 

We use our aesthetic models for two types of experiments. 
First, to study the performance of our framework against 
similar approaches, we run a small-scale experiment where 
the task is to predict the aesthetic scores of the crowdsourced 
groundtruth. We then apply the aesthetic models to rank a 
very large set of images in terms of beauty, with the aim of 
surfacing the most appealing non-popular pictures. 


6 Beauty Prediction from and for the Crowd 

To test the power of our aesthetics predictor, we run a small- 
scale experiment on the crowd-sourced dataset. We look at 


































































































CrowdBeauty 

MIT popularity 

TraditionalBeauty 

Random 

animals 

0.54 

0.37 

0.251 

0.001 

urban 

0.46 

0.27 

0.12 

0.003 

nature 

0.34 

0.29 

0.11 

-0.003 

people 

0.42 

0.31 

0.27 

-0.008 


Table 5: Spearman correlation between the crowdsourced beauty 
judgments and the scores given by different methods on the images 
of the test set. 


how much the aesthetic scores assigned by our framework 
correlate with the actual beauty scores assigned by the work¬ 
ers, and evaluate the performance of our algorithm against 
other ranking strategies. 


Baselines. We compare our method with the following two 
baselines: 


Popularity Predictor: What if a popularity predictor was 
enough to assess image beauty? To check that, we com¬ 
pare our algorithm with an established content-based im¬ 
age popularity predictor. For each picture in our ground 
truth, we query the MIT popularity AP^] a recently pro¬ 
posed framework that automatically predicts image pop¬ 
ularity scores (in terms of normalized view count) score 
given visual cues, such as col ors an d deep learning fea¬ 
tures ( [Khosla, Das Sarma, and Hamid 2014| . 


Traditional Aesthetic Predictor: What if existing aesthetic 
frameworks were general enough to assess crowdsourced 
beauty? As mentioned in ^5j our models are specifically 
trained on the crowdsourced dataset, i.e., a groundtruth of 
images generated and voted by average users. On the other 
hand, existing aesthetic predictors are generally trained on 
semi-professional images evaluated by professional photog¬ 
raphers. To justify our dataset collection effort, we show 
how a classifier trained on traditional aesthetic datasets per¬ 
forms in comparison with our method. We design this base¬ 
line with the same structure and features as our proposed 
method, but, instead of using our crowdsourced ground 
truth, we train on the AVA dataset ( [Murray, Marchesottc] 
[and Perronnin 2012| . Similar to our method, we build one 
category-specific model for each semantic category. This is 
achieved by training each category-specific model with the 
subset of AVA pictures in the corresponding category. We 
infer the category according to tags attached to each image, 
as proposed for many topic-specific aesthetic models ([Luo 
and Tang 2008; Obrador et al. 2009| . 


Experimental Setup. To evaluate our framework, for each 
semantic category we retain 800 images for test and the rest 
for training. For training, we use images from all the 3 pop¬ 
ularity ranges (tail, torso, head). For test, we consider non- 
popular images only, as our main purpose is to detect “hid¬ 
den” beautiful pictures with low number of favorites. For 
both training and test, we use the total of 47 visual features, 
that are reduced to 15 components by the PLSR algorithm. 

We then score the images in the test set using the out¬ 
put of our framework, the MIT popularity scores, the output 
of the traditional aesthetic classifier, and a random baseline. 
Next, we evaluate the performance of the three algorithms 
in terms of Spearman Correlation Coefficient between the 


i 


http://popularity.csail.mit.edu 



people urban nature animals 

Figure 5: Average crowdsourced beauty score photos in dif¬ 
ferent popularity buckets and for photos surfaced by the aes¬ 
thetics predictors. 

scores predicted on the test set by each model, and the ac¬ 
tual votes from the crowd. This metric gauges the ability of 
each model to replicate the human aesthetic preferences on 
non-popular Flickr images. 

Experimental Results. The correlation between the beauty 
scores assigned by the micro-workers on the test set and our 
proposed algorithm ( CrowdBeauty in the following) is sub¬ 
stantially high for all categories, as shown in Table [5] In 
particular, the most predictable class is the animals category, 
followed by urban. The higher performance in these two 
cases might be due to the smaller range of poses and com¬ 
positional layouts available to the photographer when shoot¬ 
ing pictures of subjects belonging to these particular cate¬ 
gories. As expected, the results of the random approach are 
completely uncorrelated from the beauty scores. For all se¬ 
mantic categories, we see that our method outperforms both 
the popularity predictor ( MIT Popularity) and the traditional 
aesthetic classifier ( TraditionalBeauty ), showing the useful¬ 
ness of building a dedicated ground truth and aesthetic clas¬ 
sifier to score non-popular web images. 

7 Surfacing Beautiful Hidden Photos 

Having provided some evidence about the effectiveness of 
our approach, we apply it in a more realistic scenario where 
the goal is to surface beautiful images from a large number 
of non-popular Flickr pictures. 

To do so, we compute the features described in ^5] on all 
the 9M images of the large-scale categorized dataset of cre¬ 
ative commons Flickr images in our dataset. We apply the 
category-specific model on the pictures in each topical cate¬ 
gory separately and rank the pictures by their predicted aes¬ 
thetics scores. For the sake of comparison, we repeat the 
same procedure with the traditional aesthetic models ( Tradi¬ 
tionalBeauty ) used as baseline in ^6j and rank them in terms 
of the predicted beauty scores. We do not consider here the 
MIT Popularity baseline as its scores can only be retrieved 
via API with a certain request delay, which it is not practical 
for a very large set of images. 

To quantify how appealing the images surfaced with our 
approach are, we implemented an additional crowdsourcing 
experiment in which images with different popularity levels 
are evaluated against the top-ranked images according to our 
models and the traditional aesthetic model. We replicated 
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Figure 6: Average beauty of the top n pictures ranked by popularity (in tail, torso, and head buckets) and by the predicted beauty scores. 


the same experimental settings described in Section [4] and 
we used a corpus composed by 200, 200 and 100 images 
from the tail, torso and head of the popularity distribution 
respectively, and we added the top 200 images from the Tra- 
ditionalBeauty and CrowdBeauty rankings. For consistency, 
we maintained the same proportion of items per class we 
used in the previous experiments, but with a smaller sample 
that focuses only on the top ranked beautiful images. 

Figure[5]shows the average beauty score for each category 
and bucket combination. Consistently across categories, the 
perceived beauty of the CrowdBeauty images is comparable 
to the most favorited photos. In fact, for nature and animals 
we observe an average increment of 0.9% and 1.3% with 
respect to most popular items and for urban and people a 
decrease of 2% and 7%, respectively. With the exception 
of people , the median of the perceived beauty score goes 
up from 3 to 4 when CrowdBeauty is adopted against Tra- 
ditionalBeauty. This behavior confirms how important the 
training of an aesthetic predictor with a reliable ground truth 
is for this task. 

The study of the average behavior of the beauty predictors 
does not show what happens if we consider only the head of 
the rank. For some applications this could be relevant, e.g., 
recommender systems suggest the top n most relevant items 
for a user. To this extent it is interesting to evaluate the per¬ 
ceived beauty of the topmost images. Figure [6] shows for 
each category how the average beauty score varies at cutoffs 
n G [5,100]. Highly popular items have a consistent be¬ 
havior across categories where items at the top of the rank 
are perceived as very appealing and the quality drops and 
stabilizes quickly after n = 20. In general, after an ini¬ 
tial variation, CrowdBeauty stabilizes above the tail, torso 
and TraditionalBeauty curves. If urban is almost stable for 
all the cutoffs, nature and animals start with lower quality 
items and rapidly jump to higher values. A different case 
is the people category where the top ten images have a very 
high score and then they drop after n=20. 

Some examples of highly ranked images surfaced by our 
algorithm alongside with the least and most favorited pic¬ 
tures are shown in Table [6] 

8 Discussion and Conclusions 

Applications and future work. The ability to rank by 
aesthetic appeal images that are nearly indistinguishable in 
terms of the user feedback by aesthetic value has immedi¬ 
ate applications. First, it promotes the democratization of 
photo sharing platforms, creating an opportunity to balance 



Table 6: Samples of images from tail and head popularity buckets, 
compared to the images surfaced by our approach. 

the visibility of popular and beautiful photos with those that 
are as beautiful but with less social exposure. As a proof- 
of-concept, we envision a new Flickr Beauty Explorer page 
that surfaces the most beautiful yet unpopular photos of the 
month to complement the classic Flickr Explorer that con¬ 
tains photos with very high social feedback. Our method can 
be used to bring valuable but unengaged users into the active 
core of the community by canalizing other people’s attention 
towards them. An extension to this work could be to use 
the aggregation of photo quality over users to spot hidden 
talents and devise incentive mechanisms to prevent them to 
churn. Furthermore, our method increases the payoff of the 
service provider by uncovering valuable content, exploitable 
for promotion, advertising, mashup, or any other commer¬ 
cial service, that would have been nearly useless otherwise. 
Also it would be interesting to study the effect of aesthetic 
reranking on the head of the popularity distribution, or on 
images relevant to a specific query. 

Limitations. Our approach comes with a few limitations, 
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Figure 7: Examples of biases in surfaced pictures. 


mainly introduced by the computer vision method we use. 

First, although machine-tags have a very high accuracy, 
they sometimes recognize objects even when they are sim¬ 
ply drawn or sketched, and attach semantic tags to non¬ 
photographic images, e.g., clipart (see Figure [7c]). Non¬ 
photographic images have their own aesthetic rules that 
differ substantially from photographs, and photo aesthetic 
predictors typically give erroneous predictions on non¬ 
photographic images. While in this work we manually re¬ 
moved some non-photographic images from our corpus to 
allow the model to smoothly learn photographic aesthetic 
rules, an automatic pre-filtering bassed on no n-photographic 
image detectors would be advisable (|Ng, Chang, and Tsui 
2007). 


Second, despite the high quality of the surfaced photos, 
some top-ranked animals and nature images receive lower 
scores than some lower-ranked ones. This behavior is due 
to biases in the learning framework: some of the top-rated 
images for animals and nature are extremely contrasted pic¬ 
tures (see Figure |7a|) thus the model wrongly over-weights 
the contrast features. Similarly, some of the surfaced urban 
pictures show strong presence of contrast/median filtering, 
such as the example in Figure [7b| 

Last, our method is less effective in surfacing good people 
images. Often highly rated pictures of people show black 
and white color palette, thus biasing the aesthetic model. 
From a broader perspective, pictures of people are differ¬ 
ent in nature from other image types. Faces grasp human 
attention more than other subjects ( [Bakhshi, Shamma, and 


Gilbert 2014| ): face perception is one of the most d eveloped 
human skills ( [Haxby, Hoffman, and Gobbini 2000] ), and 
that we have brain sub-networks dedicated to face process¬ 
ing (Freiwald and Tsao 2014). Moreover, when shooting 
photos of people, photographers need to capture much more 
than the traits of the mere subject: people come with their 
emotions, stories, and lifestyles. Portrait photography is in¬ 
deed a separate branch of traditional photography with ded- 
icated book s and compos itional techniques ( [Weiser 1999[ 
Child 2008] |Hurter 2007] ). The traditional compositional 


features that we use in our framework can only partially cap¬ 
ture the essence of the aesthetics of portraits. 


Concluding remarks. The popularization of online broad¬ 
cast communication media, the resulting information over¬ 
load, and the consequent shrinkage of the attention span on¬ 
line have shaped the Social Web increasingly towards a fran¬ 
tic search for popularity, that many users yearn for. In this 
rampant race for fame that very few can win, the crowd often 


cannot see (and sometimes tramples on) some of the valu¬ 
able gems that itself creates. To fix that in the context of 
photo sharing systems, we show that it is possible to apply 
computer vision techniques that spot beautiful images from 
the immense and often forgotten mass of pictures in the pop¬ 
ularity tail. To do that, we show the necessity of using ded¬ 
icated crowdsourced beauty judgements done by common 
people on common people’s photos, in contrast to corpora 
of professional photos annotated by professionals. We hope 
that our work can be a cautionary tale about the importance 
of targeting content quality instead of popularity, not just 
limited to multimedia items but in social media at large. 
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